Code
use_fit_start_date <- as.Date("2020-01-01")
use_fit_end_date <- as.Date("2022-01-01")
use_valid_start_date <- as.Date("2022-01-01")
use_valid_end_date <- as.Date("2023-01-01")In this workbook we introduce the various different BTYD models, starting with a discussion of the underlying theory.
Before we start working on fitting and using the various Buy-Till-You-Die models, we first need to discuss the basic underlying theory and model.
In this model, we assume a customer becomes ‘alive’ to the business at the first purchase and then makes purchases stochastically but at a steady-rate for a period of time, and then ‘dies’ - i.e. becomes inactive to the business - hence the use of “Buy-Till-You-Die”.
Thus, at a high level these models decompose into modelling the transaction events using distributions such as the Poisson or Negative Binomial, and then modelling the ‘dropout’ process using some other method.
A number of BTYD models exist and for this workshop we will focus on the BG/NBD model - the Beta-Geometric Negative Binomial Distribution model (though we will discuss the P/NBD model also).
These models require only two pieces of information about each customer’s purchasing history: the “recency” (when the last transaction occurred) and “frequency” (the count of transactions made by that customer in a specified time period).
The notation used to represent this information is
\[ X = (x, \, t_x, \, T), \] where
\[ \begin{eqnarray*} x &=& \text{the number of transactions}, \\ T &=& \text{the observed time period}, \\ t_x &=& \text{the time since the last transaction}. \end{eqnarray*} \]
From this summary data we can fit most BTYD models.
There are a number of different statistical approaches to building BTYD models - relying on a number of different assumptions about how the various recency, frequency and monetary values are modelled.
We now discuss a number of different ways of modelling this.
The P/NBD model relies on five assumptions:
As before, we express this in mathematical notation as:
\[ \begin{eqnarray*} \lambda &\sim& \Gamma(\alpha, r), \\ \mu &\sim& \Gamma(s, \beta), \\ \tau &\sim& \text{Exponential}(\mu) \end{eqnarray*} \]
This model relies on a number of base assumptions, somewhat similar to the P/NBD model but modelling lifetime with a different process:
Note that it follows from the above assumptions that the probability of a customer being ‘alive’ after any transaction is given by the Geometric distribution, and hence the Beta-Geometric in the name.
To put this into more formal mathematical notation, we have:
\[ \begin{eqnarray*} \lambda &\sim& \Gamma(\alpha, r), \\ P(\text{alive}, k) &\sim& \text{Geometric}(p, k), \\ p &\sim& \text{Beta}(a, b) \end{eqnarray*} \]
We start by modelling the P/NBD model using our synthetic datasets before we try to model real-life data.
use_fit_start_date <- as.Date("2020-01-01")
use_fit_end_date <- as.Date("2022-01-01")
use_valid_start_date <- as.Date("2022-01-01")
use_valid_end_date <- as.Date("2023-01-01")We now want to load the short time-frame synthetic data.
customer_cohortdata_tbl <- read_rds("data/synthdata_shortframe_cohort_tbl.rds")
customer_cohortdata_tbl |> glimpse()Rows: 50,000
Columns: 4
$ customer_id <chr> "SFC202001_0001", "SFC202001_0002", "SFC202001_0003", "…
$ cohort_qtr <chr> "2020 Q1", "2020 Q1", "2020 Q1", "2020 Q1", "2020 Q1", …
$ cohort_ym <chr> "2020 01", "2020 01", "2020 01", "2020 01", "2020 01", …
$ first_tnx_date <date> 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-0…
customer_simparams_tbl <- read_rds("data/synthdata_shortframe_simparams_tbl.rds")
customer_simparams_tbl |> glimpse()Rows: 50,000
Columns: 9
$ customer_id <chr> "SFC202001_0001", "SFC202001_0002", "SFC202001_0003", …
$ cohort_qtr <chr> "2020 Q1", "2020 Q1", "2020 Q1", "2020 Q1", "2020 Q1",…
$ cohort_ym <chr> "2020 01", "2020 01", "2020 01", "2020 01", "2020 01",…
$ first_tnx_date <date> 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-…
$ customer_lambda <dbl> 0.12213805, 0.29987747, 0.31504009, 0.03856001, 0.1881…
$ customer_mu <dbl> 0.127118566, 0.096184402, 0.052334526, 0.204708842, 0.…
$ customer_tau <dbl> 29.5956480, 10.9437199, 1.6938450, 2.6798108, 48.75206…
$ customer_amtmn <dbl> 46.2371662, 111.0425353, 45.8870891, 43.0754249, 10.93…
$ customer_amtcv <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
customer_transactions_tbl <- read_rds("data/synthdata_shortframe_transactions_tbl.rds")
customer_transactions_tbl |> glimpse()Rows: 296,756
Columns: 4
$ customer_id <chr> "SFC202001_0021", "SFC202001_0007", "SFC202001_0008", "S…
$ tnx_timestamp <dttm> 2020-01-01 00:34:40, 2020-01-01 02:41:45, 2020-01-01 04…
$ invoice_id <chr> "T20200101-0001", "T20200101-0002", "T20200101-0003", "T…
$ tnx_amount <dbl> 20.60, 1.83, 0.07, 30.03, 112.09, 45.85, 71.83, 166.33, …
customer_fit_stats_tbl <- read_rds("data/shortsynth_fit_1000_data_tbl.rds")
customer_fit_stats_tbl |> glimpse()Rows: 1,000
Columns: 6
$ customer_id <chr> "SFC202001_0016", "SFC202001_0027", "SFC202001_0094", "…
$ first_tnx_date <dttm> 2020-01-01 10:27:01, 2020-01-01 11:35:35, 2020-01-03 1…
$ last_tnx_date <dttm> 2021-12-30 03:24:56, 2020-02-24 23:10:44, 2020-01-03 1…
$ x <dbl> 80, 4, 0, 2, 7, 1, 6, 4, 2, 58, 0, 1, 15, 8, 37, 3, 0, …
$ t_x <dbl> 104.1009825, 7.7832485, 0.0000000, 1.6433938, 47.262213…
$ T_cal <dbl> 104.3664, 104.3596, 104.0277, 104.0590, 104.0572, 103.8…
customer_valid_stats_tbl <- read_rds("data/shortsynth_obs_validdata_tbl.rds")
customer_valid_stats_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0016", "SFC202001_0027", "SFC202001_0094"…
$ tnx_count <dbl> 36, 0, 0, 0, 0, 0, 0, 0, 0, 26, 0, 0, 6, 0, 15, 0, 0…
$ tnx_last_interval <dbl> 51.45780, NA, NA, NA, NA, NA, NA, NA, NA, 50.27866, …
We re-produce the visualisation of the transaction times we used in previous workbooks.
plot_tbl <- customer_transactions_tbl |>
group_nest(customer_id, .key = "cust_data") |>
filter(map_int(cust_data, nrow) > 3) |>
slice_sample(n = 30) |>
unnest(cust_data)
ggplot(plot_tbl, aes(x = tnx_timestamp, y = customer_id)) +
geom_line() +
geom_point() +
labs(
x = "Date",
y = "Customer ID",
title = "Visualisation of Customer Transaction Times"
) +
theme(axis.text.y = element_text(size = 10))We now turn our attention to deriving the log-likelihood model for the P/NBD model.
We assume that we know that a given customer has made \(x\) transactions after the initial one over an observed period of time \(T\), and we label these transactions \(t_1\), \(t_2\), …, \(t_x\).
To model the likelihood for this observation, we need to consider two possibilities: one for where the customer is still ‘alive’ at \(T\), and one where the customer has ‘died’ by \(T\).
In the first instance, the likelihood is the product of the observations of each transaction, multiplied by the likelihood of the customer still being alive at time \(T\).
Because we are modelling the transaction counts as a Poisson process, this corresponds to the times between events following an exponential distribution, and so both the transaction times and the lifetime likelihoods use the exponential.
This gives us:
\[ \begin{eqnarray*} L(\lambda \, | \, t_1, t_2, ..., t_x, T, \, \tau > T) &=& \lambda e^{-\lambda t_1} \lambda e^{-\lambda(t_2 - t_1)} ... \lambda e^{-\lambda (t_x - t_{x-1})} e^{-\lambda(T - t)} \\ &=& \lambda^x e^{-\lambda T} \end{eqnarray*} \]
and we can combine this with the likelihood of the lifetime of the customer \(\tau\) being greater than the observation window \(T\),
\[ P(\tau > T \, | \, \mu) = e^{-\mu T} \]
For the second case, the customer becomes inactive at some time \(\tau\) in the interval \((t_x, T]\), and so the likelihood is
\[ \begin{eqnarray*} L(\lambda \, | \, t_1, t_2, ..., t_x, T, \, \tau > T) &=& \lambda e^{-\lambda t_1} \lambda e^{-\lambda(t_2 - t_1)} ... \lambda e^{-\lambda (t_x - t_{x-1})} e^{-\lambda(\tau - t_x)} \\ &=& \lambda^x e^{-\lambda \tau} \end{eqnarray*} \]
In both cases we do not need the times of the individual transactions, and all we need are the values \((x, t_x, T)\).
As we cannot observe \(\tau\), we want to remove the conditioning on \(\tau\) by integrating it out.
\[ \begin{eqnarray*} L(\lambda, \mu \, | \, x, t_x, T) &=& L(\lambda \, | \, t_1, t_2, ..., t_x, T, \, \tau > T) \, P(\tau > T \, | \, \mu) \; + \\ & & \; \;\; \; \; \;\; \; \int^T_{t_x} L(\lambda \, | \, x, T, \text{ inactive at } (t_x, T] ) \, f(\tau \, | \mu) \, d\tau \\ &=& \lambda^x e^{-\lambda T} e^{\mu T} + \lambda^x \int^T_{t_x} e^{-\lambda \tau} \mu e^{-\mu \tau} d\tau \\ &=& \lambda^x e^{-(\lambda + \mu)T} + \frac{\lambda^x \mu}{\lambda + \mu} e^{-(\lambda + \mu) t_x} + \frac{\lambda^x \mu}{\lambda + \mu} e^{-(\lambda + \mu) T} \\ &=& \frac{\lambda^x \mu}{\lambda + \mu} e^{-(\lambda + \mu) t_x} + \frac{\lambda^{x+1} \mu}{\lambda + \mu} e^{-(\lambda + \mu) T} \end{eqnarray*} \]
In Stan, we do not calculate the likelihoods but the Log-likelihood, so we need to take the log of this expression. This creates a problem, as we have no easy way to calculate \(\log(a + b)\). As this expression occurs a lot, Stan provides a log_sum_exp(), which is defined by
\[ \ln (a + b) = \text{LogSumExp}(\ln a, \, \ln b) \]
We then use this log_sum_exp() function to calculate the Log-Likelihood for the model. Note we use “LogSumExp” here to get around limitations in the renderer.
\[ \begin{eqnarray*} LL(\lambda, \mu \, | \, x, t_x, T) &=& \log \left( \frac{\lambda^x \, \mu}{\lambda + \mu} \left( e^{-(\lambda + \mu) t_x} + \lambda e^{-(\lambda + \mu) T} \right) \right) \\ &=& x \log \lambda + \log \mu - \log(\lambda + \mu) \; + \\ & & \;\;\;\;\;\;\; \text{LogSumExp}(-(\lambda + \mu) \, t_x, \; \log \lambda - (\lambda + \mu) \, T) \end{eqnarray*} \]
This is the log-likelihood model we want to fit in Stan.
syslog(
glue("Creating the first P/NBD model"),
level = "INFO"
)We now construct our Stan model and prepare to fit it with our synthetic dataset.
Before we start on that, we set a few parameters for the workbook to organise our Stan code.
stan_modeldir <- "stan_models"
stan_codedir <- "stan_code"We also want to set a number of overall parameters for this workbook
To start the fit data, we want to use the 1,000 customers. We also need to calculate the summary statistics for the validation period.
We start with the Stan model.
functions {
#include util_functions.stan
}
data {
int<lower=1> n; // number of customers
vector<lower=0>[n] t_x; // time to most recent purchase
vector<lower=0>[n] T_cal; // total observation time
vector<lower=0>[n] x; // number of purchases observed
real<lower=0> lambda_mn; // prior mean for lambda
real<lower=0> lambda_cv; // prior cv for lambda
real<lower=0> mu_mn; // prior mean for mu
real<lower=0> mu_cv; // prior mean for mu
}
transformed data {
real<lower=0> r = 1 / (lambda_cv * lambda_cv);
real<lower=0> alpha = 1 / (lambda_cv * lambda_cv * lambda_mn);
real<lower=0> s = 1 / (mu_cv * mu_cv);
real<lower=0> beta = 1 / (mu_cv * mu_cv * mu_mn);
}
parameters {
vector<lower=0>[n] lambda; // purchase rate
vector<lower=0>[n] mu; // lifetime dropout rate
}
model {
// setting priors
lambda ~ gamma(r, alpha);
mu ~ gamma(s, beta);
target += calculate_pnbd_loglik(n, lambda, mu, x, t_x, T_cal);
}
generated quantities {
vector[n] p_alive; // Probability that they are still "alive"
p_alive = 1 ./ (1 + mu ./ (mu + lambda) .* (exp((lambda + mu) .* (T_cal - t_x)) - 1));
}
This file contains a few new features of Stan - named file includes and user-defined functions - calculate_pnbd_loglik. We look at this file here:
real calculate_pnbd_loglik(int n, vector lambda, vector mu,
data vector x, data vector t_x, data vector T_cal) {
// likelihood
vector[n] t1;
vector[n] t2;
vector[n] lpm;
vector[n] lht;
vector[n] rht;
lpm = lambda + mu;
lht = log(lambda) - lpm .* T_cal;
rht = log(mu) - lpm .* t_x;
t1 = x .* log(lambda) - log(lpm);
for (i in 1:n) {
t2[i] = log_sum_exp(lht[i], rht[i]);
}
return(sum(t1) + sum(t2));
}
real calculate_bgnbd_loglik(int n, vector lambda, vector p,
data vector x, data vector t_x, data vector T_cal) {
// likelihood
vector[n] t1;
vector[n] t2;
vector[n] lht;
vector[n] rht;
lht = log(p) + (x-1) .* log(1-p) + x .* log(lambda) - lambda .* t_x;
rht = x .* log(1-p) + x .* log(lambda) - lambda .* T_cal;
for(i in 1:n) {
t2[i] = log_sum_exp(lht[i], rht[i]);
}
return(sum(t2));
}
We now compile this model using CmdStanR.
pnbd_fixed_stanmodel <- cmdstan_model(
"stan_code/pnbd_fixed.stan",
include_paths = stan_codedir,
pedantic = TRUE,
dir = stan_modeldir
)We then use this compiled model with our data to produce a fit of the data.
stan_modelname <- "pnbd_init_short_fixed1"
stanfit_prefix <- str_c("fit_", stan_modelname)
stanfit_seed <- stanfit_seed + 1
stanfit_object_file <- glue("data/{stanfit_prefix}_stanfit.rds")
stan_data_lst <- customer_fit_stats_tbl |>
select(customer_id, x, t_x, T_cal) |>
compose_data(
lambda_mn = 0.25,
lambda_cv = 1.00,
mu_mn = 0.10,
mu_cv = 1.00,
)
if(!file_exists(stanfit_object_file)) {
pnbd_init_short_fixed1_stanfit <- pnbd_fixed_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = stanfit_seed,
save_warmup = TRUE,
output_dir = stan_modeldir,
output_basename = stanfit_prefix,
)
pnbd_init_short_fixed1_stanfit$save_object(stanfit_object_file, compress = "gzip")
} else {
pnbd_init_short_fixed1_stanfit <- read_rds(stanfit_object_file)
}Running MCMC with 4 parallel chains...
Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 2 finished in 22.5 seconds.
Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 4 finished in 22.8 seconds.
Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 1 finished in 23.2 seconds.
Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 3 finished in 23.3 seconds.
All 4 chains finished successfully.
Mean chain execution time: 23.0 seconds.
Total execution time: 23.6 seconds.
pnbd_init_short_fixed1_stanfit$summary()# A tibble: 3,001 × 10
variable mean median sd mad q5 q95 rhat ess_bulk
<chr> <num> <num> <num> <num> <num> <num> <num> <num>
1 lp__ -1.62e+4 -1.62e+4 36.7 37.5 -1.62e+4 -1.61e+4 1.01 613.
2 lambda[1] 7.47e-1 7.46e-1 0.0837 0.0826 6.15e-1 8.91e-1 1.00 3629.
3 lambda[2] 3.55e-1 3.28e-1 0.165 0.155 1.34e-1 6.64e-1 1.01 3610.
4 lambda[3] 1.36e-1 7.52e-2 0.167 0.0854 3.94e-3 4.86e-1 1.00 2294.
5 lambda[4] 4.05e-1 3.48e-1 0.250 0.206 1.02e-1 9.09e-1 1.00 3762.
6 lambda[5] 1.39e-1 1.31e-1 0.0534 0.0531 6.57e-2 2.36e-1 1.00 4754.
7 lambda[6] 1.40e-1 1.10e-1 0.113 0.0901 1.96e-2 3.54e-1 1.00 3387.
8 lambda[7] 4.24e-1 3.95e-1 0.170 0.163 1.84e-1 7.38e-1 1.00 5217.
9 lambda[8] 3.13e-1 2.92e-1 0.146 0.139 1.14e-1 5.97e-1 1.00 3261.
10 lambda[9] 4.54e-1 3.94e-1 0.284 0.254 1.09e-1 1.00e+0 1.00 3485.
# ℹ 2,991 more rows
# ℹ 1 more variable: ess_tail <num>
We have some basic HMC-based validity statistics we can check.
pnbd_init_short_fixed1_stanfit$cmdstan_diagnose()Processing csv files: /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed1-1.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed1-2.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed1-3.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed1-4.csvWarning: non-fatal error reading adaptation data
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
No divergent transitions found.
Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.
Effective sample size satisfactory.
Split R-hat values satisfactory all parameters.
Processing complete, no problems detected.
Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.
parameter_subset <- c(
"lambda[1]", "lambda[2]", "lambda[3]", "lambda[4]",
"mu[1]", "mu[2]", "mu[3]", "mu[4]"
)
pnbd_init_short_fixed1_stanfit$draws(inc_warmup = TRUE) |>
mcmc_trace(
pars = parameter_subset,
n_warmup = 500
) +
ggtitle("Full Traceplots of Some Lambda and Mu Values")As the warmup is skewing the y-axis somewhat, we repeat this process without the warmup.
pnbd_init_short_fixed1_stanfit$draws(inc_warmup = FALSE) |>
mcmc_trace(pars = parameter_subset) +
expand_limits(y = 0) +
labs(
x = "Iteration",
y = "Value",
title = "Traceplot of Sample of Lambda and Mu Values"
) +
theme(axis.text.x = element_text(size = 10))A common MCMC diagnostic is \(\hat{R}\) - which is a measure of the ‘similarity’ of the chains.
pnbd_init_short_fixed1_stanfit |>
rhat(pars = c("lambda", "mu")) |>
mcmc_rhat() +
ggtitle("Plot of Parameter R-hat Values")Related to this quantity is the concept of effective sample size, \(N_{eff}\), an estimate of the size of the sample from a statistical information point of view.
pnbd_init_short_fixed1_stanfit |>
neff_ratio(pars = c("lambda", "mu")) |>
mcmc_neff() +
ggtitle("Plot of Parameter Effective Sample Sizes")Finally, we also want to look at autocorrelation in the chains for each parameter.
pnbd_init_short_fixed1_stanfit$draws() |>
mcmc_acf(pars = parameter_subset) +
ggtitle("Autocorrelation Plot of Sample Values")As before, this first fit has a comprehensive run of fit diagnostics, but for the sake of brevity in later models we will show only the traceplots once we are satisfied with the validity of the sample.
As we are still working with synthetic data, we know the true values for each customer and so we can check how good our model is at recovering the true values on a customer-by-customer basis.
As in previous workbooks, we build our validation datasets and then check the distribution of \(q\)-values for both \(\lambda\) and \(\mu\) across the customer base.
pnbd_init_short_fixed1_valid_lst <- create_pnbd_posterior_validation_data(
stanfit = pnbd_init_short_fixed1_stanfit,
data_tbl = customer_fit_stats_tbl,
simparams_tbl = customer_simparams_tbl
)
pnbd_init_short_fixed1_valid_lst$lambda_qval_plot |> plot()pnbd_init_short_fixed1_valid_lst$mu_qval_plot |> plot()These plots looks like the model is recovering the parameters well, but cannot rely on this approach once we use real data so we will stop using this now.
Rather than relying on knowing the ‘true’ answer, we instead will use our posterior sample to generate data and compare this simulated data against the data we fit. This procedure is similar to what we did before but now we focus on in sample data rather than using validation data.
pnbd_stanfit <- pnbd_init_short_fixed1_stanfit |>
recover_types(customer_fit_stats_tbl)
pnbd_init_short_fixed1_simstats_tbl <- construct_pnbd_posterior_statistics(
stanfit = pnbd_stanfit,
fitdata_tbl = customer_fit_stats_tbl
)
pnbd_init_short_fixed1_simstats_tbl |> glimpse()Rows: 2,000,000
Columns: 6
$ customer_id <chr> "SFC202001_0016", "SFC202001_0016", "SFC202001_0016", "…
$ first_tnx_date <dttm> 2020-01-01 10:27:01, 2020-01-01 10:27:01, 2020-01-01 1…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ post_lambda <dbl> 0.676106, 0.807568, 0.898712, 0.614051, 0.664696, 0.818…
$ post_mu <dbl> 0.000203323, 0.000437640, 0.002158650, 0.003038690, 0.0…
$ p_alive <dbl> 0.999941, 0.999870, 0.999353, 0.999125, 0.994186, 0.998…
We now want to write out the simulation stats to disk.
#! echo: TRUE
pnbd_init_short_fixed1_simstats_tbl |>
write_rds("data/pnbd_init_short_fixed1_assess_model_simstats_tbl.rds", compress = "gz")We then use these posterior statistics as inputs to our simulations to help us assess the in-sample quality of fit.
fit_label <- "pnbd_init_short_fixed1"
precompute_dir <- glue("precompute/{fit_label}")
ensure_exists_precompute_directory(precompute_dir)[1] TRUE
pnbd_init_short_fixed1_fitsims_index_tbl <- pnbd_init_short_fixed1_simstats_tbl |>
mutate(
start_dttm = first_tnx_date |> as.POSIXct(),
end_dttm = use_fit_end_date |> as.POSIXct(),
lambda = post_lambda,
mu = post_mu,
p_alive = 1, ### In-sample validation, so customer begins active
tnx_mu = 100, ### We are not simulating tnx size, so put in defaults
tnx_cv = 1 ###
) |>
group_nest(customer_id, .key = "cust_params", keep = TRUE) |>
mutate(
sim_file = glue(
"{precompute_dir}/sims_fit_{fit_label}_{customer_id}.rds"
)
)
pnbd_init_short_fixed1_fitsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0016", "SFC202001_0027", "SFC202001_0094", "SFC…
$ cust_params <list<tibble[,12]>> [<tbl_df[2000 x 12]>], [<tbl_df[2000 x 12]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed1/sims_fit_pnbd_init_sho…
We now use this setup to generate our simulations.
precomputed_tbl <- dir_ls(glue("{precompute_dir}")) |>
as.character() |>
enframe(name = NULL, value = "sim_file")
runsims_tbl <- pnbd_init_short_fixed1_fitsims_index_tbl |>
anti_join(precomputed_tbl, by = "sim_file")
if(nrow(runsims_tbl) > 0) {
pnbd_init_short_fixed1_fitsims_index_tbl <- runsims_tbl |>
mutate(
chunk_data = future_map2_int(
cust_params, sim_file,
run_simulations_chunk,
sim_func = generate_pnbd_validation_transactions,
.options = furrr_options(
globals = c(
"calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
"generate_pnbd_validation_transactions"
),
packages = c("tidyverse", "fs"),
scheduling = Inf,
seed = 421
),
.progress = TRUE
)
)
}
pnbd_init_short_fixed1_fitsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0016", "SFC202001_0027", "SFC202001_0094", "SFC…
$ cust_params <list<tibble[,12]>> [<tbl_df[2000 x 12]>], [<tbl_df[2000 x 12]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed1/sims_fit_pnbd_init_sho…
We now want to load up the summary statistics for each of our customers for later analysis.
retrieve_sim_stats <- ~ .x |>
read_rds() |>
select(draw_id, sim_data, sim_tnx_count, sim_tnx_last)pnbd_init_short_fixed1_fit_simstats_tbl <- pnbd_init_short_fixed1_fitsims_index_tbl |>
mutate(
sim_data = map(
sim_file, retrieve_sim_stats,
.progress = "pnbd_init_short_fixed1_fit"
)
) |>
select(customer_id, sim_data) |>
unnest(sim_data)
pnbd_init_short_fixed1_fit_simstats_tbl |> glimpse()Rows: 2,000,000
Columns: 5
$ customer_id <chr> "SFC202001_0016", "SFC202001_0016", "SFC202001_0016", "S…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ sim_data <list> [<tbl_df[54 x 2]>], [<tbl_df[93 x 2]>], [<tbl_df[74 x 2…
$ sim_tnx_count <int> 54, 93, 74, 4, 83, 63, 95, 34, 81, 76, 86, 0, 54, 30, 75…
$ sim_tnx_last <dttm> 2021-12-22 05:14:44, 2021-12-18 19:07:18, 2021-06-04 18…
We now use this data to check how well our model fits the data.
We start by checking the high level summary statistics, such as customers with more than one transaction both in the observed data and the simulation data, total transaction count observed
obs_customer_count <- customer_fit_stats_tbl |>
filter(x > 0) |>
nrow()
sim_data_tbl <- pnbd_init_short_fixed1_fit_simstats_tbl |>
filter(sim_tnx_count > 0) |>
count(draw_id, name = "sim_customer_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_customer_count), bins = 50) +
geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
labs(
x = "Count of Multi-transaction Customers",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Customer Counts",
subtitle = "(observed value in red)"
)The observed count of customers at least one additional transaction after the first is captured by this simulation.
We now check the count of all transactions.
obs_total_count <- customer_fit_stats_tbl |>
pull(x) |>
sum()
sim_data_tbl <- pnbd_init_short_fixed1_fit_simstats_tbl |>
count(draw_id, wt = sim_tnx_count, name = "sim_total_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_total_count), bins = 50) +
geom_vline(aes(xintercept = obs_total_count), colour = "red") +
labs(
x = "Count of Total Transactions",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Total Counts",
subtitle = "(observed value in red)"
)As before, these all look good. Our model is doing a good job capturing the data.
We now look at the quantiles for the transaction counts across each customer.
obs_quantiles_tbl <- customer_fit_stats_tbl |>
reframe(
prob_label = c("p10", "p25", "p50", "p75", "p90", "p99"),
prob_value = quantile(x, probs = c(0.10, 0.25, 0.50, 0.75, 0.90, 0.99))
)
sim_data_tbl <- pnbd_init_short_fixed1_fit_simstats_tbl |>
group_by(draw_id) |>
summarise(
p10 = quantile(sim_tnx_count, 0.10),
p25 = quantile(sim_tnx_count, 0.25),
p50 = quantile(sim_tnx_count, 0.50),
p75 = quantile(sim_tnx_count, 0.75),
p90 = quantile(sim_tnx_count, 0.90),
p99 = quantile(sim_tnx_count, 0.99)
) |>
pivot_longer(
cols = !draw_id,
names_to = "prob_label",
values_to = "sim_prob_values"
) |>
inner_join(obs_quantiles_tbl, by = "prob_label")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_prob_values), binwidth = 1) +
geom_vline(aes(xintercept = prob_value), colour = "red") +
facet_wrap(vars(prob_label), nrow = 2, scales = "free") +
labs(
x = "Quantile of Counts",
y = "Frequency",
title = "Comparison Plots of Transaction Count Quantiles"
)We write this data to disk
pnbd_init_short_fixed1_fit_simstats_tbl |>
write_rds("data/pnbd_init_short_fixed1_assess_fit_simstats_tbl.rds", compress = "gz")We now repeat this exercise, but for the validation period of 2022.
precompute_dir <- glue("precompute/{fit_label}")
ensure_exists_precompute_directory(precompute_dir)[1] TRUE
use_start <- use_valid_start_date |> as.POSIXct()
use_final <- use_valid_end_date |> as.POSIXct()
pnbd_init_short_fixed1_validsims_index_tbl <- pnbd_init_short_fixed1_simstats_tbl |>
mutate(
start_dttm = use_start,
end_dttm = use_final,
lambda = post_lambda,
mu = post_mu,
tnx_mu = 1, ### We are not simulating tnx size
tnx_cv = 1 ###
) |>
group_nest(customer_id, .key = "cust_params", keep = TRUE) |>
mutate(
sim_file = glue(
"{precompute_dir}/sims_valid_{fit_label}_{customer_id}.rds"
)
)
pnbd_init_short_fixed1_validsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0016", "SFC202001_0027", "SFC202001_0094", "SFC…
$ cust_params <list<tibble[,12]>> [<tbl_df[2000 x 12]>], [<tbl_df[2000 x 12]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed1/sims_valid_pnbd_init_s…
We now can run these simulations to check how well our model captures transactions out of the fitted data.
precomputed_tbl <- dir_ls(glue("{precompute_dir}")) |>
as.character() |>
enframe(name = NULL, value = "sim_file")
runsims_tbl <- pnbd_init_short_fixed1_validsims_index_tbl |>
anti_join(precomputed_tbl, by = "sim_file")
if(nrow(runsims_tbl) > 0) {
pnbd_init_short_fixed1_validsims_index_tbl <- runsims_tbl |>
mutate(
chunk_data = future_map2_int(
cust_params, sim_file,
run_simulations_chunk,
sim_func = generate_pnbd_validation_transactions,
.options = furrr_options(
globals = c(
"calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
"generate_pnbd_validation_transactions"
),
packages = c("tidyverse", "fs"),
scheduling = Inf,
seed = 421
),
.progress = TRUE
)
)
}
pnbd_init_short_fixed1_validsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0016", "SFC202001_0027", "SFC202001_0094", "SFC…
$ cust_params <list<tibble[,12]>> [<tbl_df[2000 x 12]>], [<tbl_df[2000 x 12]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed1/sims_valid_pnbd_init_s…
Now that we have generated our simulations we want to load the data from the files and construct a dataset for use as part of the validation.
pnbd_init_short_fixed1_valid_simstats_tbl <- pnbd_init_short_fixed1_validsims_index_tbl |>
mutate(
sim_data = map(
sim_file, retrieve_sim_stats,
.progress = "pnbd_init_short_fixed1_valid"
)
) |>
select(customer_id, sim_data) |>
unnest(sim_data)
pnbd_init_short_fixed1_valid_simstats_tbl |> glimpse()Rows: 2,000,000
Columns: 5
$ customer_id <chr> "SFC202001_0016", "SFC202001_0016", "SFC202001_0016", "S…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ sim_data <list> [<tbl_df[33 x 2]>], [<tbl_df[43 x 2]>], [<tbl_df[37 x 2…
$ sim_tnx_count <int> 33, 43, 37, 27, 35, 46, 52, 52, 43, 31, 13, 6, 37, 11, 4…
$ sim_tnx_last <dttm> 2022-12-31 14:26:42, 2022-12-22 01:07:43, 2022-12-22 06…
We start by checking the high level summary statistics, such as customers with more than one transaction both in the observed data and the simulation data, total transaction count observed
obs_customer_count <- customer_valid_stats_tbl |>
filter(tnx_count > 0) |>
nrow()
sim_data_tbl <- pnbd_init_short_fixed1_valid_simstats_tbl |>
filter(sim_tnx_count > 0) |>
count(draw_id, name = "sim_customer_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_customer_count), bins = 50) +
geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
labs(
x = "Count of Multi-transaction Customers",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Customer Counts",
subtitle = "(observed value in red)"
)We now check the count of all transactions.
obs_total_count <- customer_valid_stats_tbl |>
pull(tnx_count) |>
sum()
sim_data_tbl <- pnbd_init_short_fixed1_valid_simstats_tbl |>
count(draw_id, wt = sim_tnx_count, name = "sim_total_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_total_count), bins = 50) +
geom_vline(aes(xintercept = obs_total_count), colour = "red") +
labs(
x = "Count of Total Transactions",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Total Counts",
subtitle = "(observed value in red)"
)We now look at the quantiles for the transaction counts across each customer.
obs_quantiles_tbl <- customer_valid_stats_tbl |>
reframe(
prob_label = c("p10", "p25", "p50", "p75", "p90", "p99"),
prob_value = quantile(tnx_count, probs = c(0.10, 0.25, 0.50, 0.75, 0.90, 0.99))
)
sim_data_tbl <- pnbd_init_short_fixed1_valid_simstats_tbl |>
filter(sim_tnx_count > 0) |>
group_by(draw_id) |>
summarise(
p10 = quantile(sim_tnx_count, 0.10),
p25 = quantile(sim_tnx_count, 0.25),
p50 = quantile(sim_tnx_count, 0.50),
p75 = quantile(sim_tnx_count, 0.75),
p90 = quantile(sim_tnx_count, 0.90),
p99 = quantile(sim_tnx_count, 0.99)
) |>
pivot_longer(
cols = !draw_id,
names_to = "prob_label",
values_to = "sim_prob_values"
) |>
inner_join(obs_quantiles_tbl, by = "prob_label")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_prob_values), binwidth = 1) +
geom_vline(aes(xintercept = prob_value), colour = "red") +
facet_wrap(vars(prob_label), nrow = 2, scales = "free") +
labs(
x = "Quantile of Counts",
y = "Frequency",
title = "Comparison Plots of Transaction Count Quantiles"
)We write this data to disk
pnbd_init_short_fixed1_valid_simstats_tbl |>
write_rds("data/pnbd_init_short_fixed1_assess_valid_simstats_tbl.rds", compress = "gz")rm(pnbd_init_short_fixed1_fit_simstats_tbl)
rm(pnbd_init_short_fixed1_valid_simstats_tbl)All of this looks good, and our model appears to do a good job of recreating the data both in and out of sample, but this is not surprising since we mostly used the same input parameters.
For real life work, we will not have this benefit, so it is worth seeing how sensitive this process is to the prior parameters.
syslog(
glue("Creating the second P/NBD model (lower CoV)"),
level = "INFO"
)We want to try an alternate prior model with a smaller co-efficient of variation to see what impact it has on our procedures.
stan_modelname <- "pnbd_init_short_fixed2"
stanfit_prefix <- str_c("fit_", stan_modelname)
stanfit_seed <- stanfit_seed + 1
stanfit_object_file <- glue("data/{stanfit_prefix}_stanfit.rds")
stan_data_lst <- customer_fit_stats_tbl |>
select(customer_id, x, t_x, T_cal) |>
compose_data(
lambda_mn = 0.25,
lambda_cv = 0.50,
mu_mn = 0.10,
mu_cv = 0.50,
)
if(!file_exists(stanfit_object_file)) {
pnbd_init_short_fixed2_stanfit <- pnbd_fixed_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = stanfit_seed,
save_warmup = TRUE,
output_dir = stan_modeldir,
output_basename = stanfit_prefix,
)
pnbd_init_short_fixed2_stanfit$save_object(stanfit_object_file, compress = "gzip")
} else {
pnbd_init_short_fixed2_stanfit <- read_rds(stanfit_object_file)
}Running MCMC with 4 parallel chains...
Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 1 finished in 18.7 seconds.
Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 2 finished in 19.4 seconds.
Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 3 finished in 19.3 seconds.
Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 4 finished in 19.5 seconds.
All 4 chains finished successfully.
Mean chain execution time: 19.2 seconds.
Total execution time: 20.5 seconds.
pnbd_init_short_fixed2_stanfit$summary()# A tibble: 3,001 × 10
variable mean median sd mad q5 q95 rhat ess_bulk
<chr> <num> <num> <num> <num> <num> <num> <num> <num>
1 lp__ -3.37e+4 -3.37e+4 32.2 32.1 -3.38e+4 -3.37e+4 1.01 689.
2 lambda[1] 6.99e-1 6.96e-1 0.0726 0.0729 5.84e-1 8.26e-1 1.00 4644.
3 lambda[2] 3.06e-1 2.92e-1 0.113 0.109 1.50e-1 5.10e-1 1.00 4378.
4 lambda[3] 2.11e-1 1.91e-1 0.106 0.101 7.22e-2 4.12e-1 1.01 4406.
5 lambda[4] 3.05e-1 2.86e-1 0.129 0.127 1.27e-1 5.50e-1 1.00 3863.
6 lambda[5] 1.64e-1 1.58e-1 0.0508 0.0493 9.06e-2 2.54e-1 1.01 5930.
7 lambda[6] 1.96e-1 1.81e-1 0.0906 0.0876 7.39e-2 3.69e-1 1.00 3952.
8 lambda[7] 3.48e-1 3.32e-1 0.116 0.114 1.86e-1 5.52e-1 1.01 3763.
9 lambda[8] 2.84e-1 2.70e-1 0.106 0.0996 1.31e-1 4.81e-1 1.00 4745.
10 lambda[9] 3.09e-1 2.91e-1 0.130 0.121 1.35e-1 5.43e-1 1.00 4127.
# ℹ 2,991 more rows
# ℹ 1 more variable: ess_tail <num>
We have some basic HMC-based validity statistics we can check.
pnbd_init_short_fixed2_stanfit$cmdstan_diagnose()Processing csv files: /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed2-1.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed2-2.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed2-3.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed2-4.csvWarning: non-fatal error reading adaptation data
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
No divergent transitions found.
Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.
Effective sample size satisfactory.
Split R-hat values satisfactory all parameters.
Processing complete, no problems detected.
Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.
parameter_subset <- c(
"lambda[1]", "lambda[2]", "lambda[3]", "lambda[4]",
"mu[1]", "mu[2]", "mu[3]", "mu[4]"
)
pnbd_init_short_fixed2_stanfit$draws(inc_warmup = FALSE) |>
mcmc_trace(pars = parameter_subset) +
expand_limits(y = 0) +
labs(
x = "Iteration",
y = "Value",
title = "Traceplot of Sample of Lambda and Mu Values"
) +
theme(axis.text.x = element_text(size = 10))We want to check the \(N_{eff}\) statistics also.
pnbd_init_short_fixed2_stanfit |>
neff_ratio(pars = c("lambda", "mu")) |>
mcmc_neff() +
ggtitle("Plot of Parameter Effective Sample Sizes")We now want to check our in-sample data.
pnbd_stanfit <- pnbd_init_short_fixed2_stanfit |>
recover_types(customer_fit_stats_tbl)
pnbd_init_short_fixed2_simstats_tbl <- construct_pnbd_posterior_statistics(
stanfit = pnbd_stanfit,
fitdata_tbl = customer_fit_stats_tbl
)
pnbd_init_short_fixed2_simstats_tbl |> glimpse()Rows: 2,000,000
Columns: 6
$ customer_id <chr> "SFC202001_0016", "SFC202001_0016", "SFC202001_0016", "…
$ first_tnx_date <dttm> 2020-01-01 10:27:01, 2020-01-01 10:27:01, 2020-01-01 1…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ post_lambda <dbl> 0.626380, 0.773684, 0.651973, 0.754594, 0.667604, 0.723…
$ post_mu <dbl> 0.00611598, 0.03853470, 0.01895990, 0.03860240, 0.01920…
$ p_alive <dbl> 0.998236, 0.988717, 0.994523, 0.988726, 0.994441, 0.987…
We write out this data to disk.
#! echo: TRUE
pnbd_init_short_fixed2_simstats_tbl |>
write_rds("data/pnbd_init_short_fixed2_assess_model_simstats_tbl.rds", compress = "gz")We now want to generate the transactions for the in-sample data.
fit_label <- "pnbd_init_short_fixed2"
precompute_dir <- glue("precompute/{fit_label}")
ensure_exists_precompute_directory(precompute_dir)[1] TRUE
pnbd_init_short_fixed2_fitsims_index_tbl <- pnbd_init_short_fixed2_simstats_tbl |>
mutate(
start_dttm = first_tnx_date |> as.POSIXct(),
end_dttm = use_fit_end_date |> as.POSIXct(),
lambda = post_lambda,
mu = post_mu,
p_alive = 1, ### In-sample validation, so customer begins active
tnx_mu = 100, ### We are not simulating tnx size, so put in defaults
tnx_cv = 1 ###
) |>
group_nest(customer_id, .key = "cust_params", keep = TRUE) |>
mutate(
sim_file = glue(
"{precompute_dir}/sims_fit_{fit_label}_{customer_id}.rds"
)
)
pnbd_init_short_fixed2_fitsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0016", "SFC202001_0027", "SFC202001_0094", "SFC…
$ cust_params <list<tibble[,12]>> [<tbl_df[2000 x 12]>], [<tbl_df[2000 x 12]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed2/sims_fit_pnbd_init_sho…
We now use this setup to generate our simulations.
precomputed_tbl <- dir_ls(glue("{precompute_dir}")) |>
as.character() |>
enframe(name = NULL, value = "sim_file")
runsims_tbl <- pnbd_init_short_fixed2_fitsims_index_tbl |>
anti_join(precomputed_tbl, by = "sim_file")
if(nrow(runsims_tbl) > 0) {
pnbd_init_short_fixed2_fitsims_index_tbl <- runsims_tbl |>
mutate(
chunk_data = future_map2_int(
cust_params, sim_file,
run_simulations_chunk,
sim_func = generate_pnbd_validation_transactions,
.options = furrr_options(
globals = c(
"calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
"generate_pnbd_validation_transactions"
),
packages = c("tidyverse", "fs"),
scheduling = Inf,
seed = 422
),
.progress = TRUE
)
)
}
pnbd_init_short_fixed2_fitsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0016", "SFC202001_0027", "SFC202001_0094", "SFC…
$ cust_params <list<tibble[,12]>> [<tbl_df[2000 x 12]>], [<tbl_df[2000 x 12]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed2/sims_fit_pnbd_init_sho…
We now want to load up the summary statistics for each of our customers for later analysis.
pnbd_init_short_fixed2_fit_simstats_tbl <- pnbd_init_short_fixed2_fitsims_index_tbl |>
mutate(
sim_data = map(
sim_file, retrieve_sim_stats,
.progress = "pnbd_init_short_fixed2_fit"
)
) |>
select(customer_id, sim_data) |>
unnest(sim_data)
pnbd_init_short_fixed2_fit_simstats_tbl |> glimpse()Rows: 2,000,000
Columns: 5
$ customer_id <chr> "SFC202001_0016", "SFC202001_0016", "SFC202001_0016", "S…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ sim_data <list> [<tbl_df[62 x 2]>], [<tbl_df[5 x 2]>], [<tbl_df[68 x 2]…
$ sim_tnx_count <int> 62, 5, 68, 47, 53, 27, 33, 9, 0, 5, 16, 35, 43, 11, 0, 3…
$ sim_tnx_last <dttm> 2021-12-24 19:24:33, 2020-02-23 11:21:50, 2021-12-30 04…
We now use this data to check how well our model fits the data.
We start by checking the high level summary statistics, such as customers with more than one transaction both in the observed data and the simulation data, total transaction count observed
obs_customer_count <- customer_fit_stats_tbl |>
filter(x > 0) |>
nrow()
sim_data_tbl <- pnbd_init_short_fixed2_fit_simstats_tbl |>
filter(sim_tnx_count > 0) |>
count(draw_id, name = "sim_customer_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_customer_count), bins = 50) +
geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
labs(
x = "Count of Multi-transaction Customers",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Customer Counts",
subtitle = "(observed value in red)"
)The observed count of customers at least one additional transaction after the first is captured by this simulation.
We now check the count of all transactions.
obs_total_count <- customer_fit_stats_tbl |>
pull(x) |>
sum()
sim_data_tbl <- pnbd_init_short_fixed2_fit_simstats_tbl |>
count(draw_id, wt = sim_tnx_count, name = "sim_total_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_total_count), bins = 50) +
geom_vline(aes(xintercept = obs_total_count), colour = "red") +
labs(
x = "Count of Total Transactions",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Total Counts",
subtitle = "(observed value in red)"
)As before, these all look good. Our model is doing a good job capturing the data.
We now look at the quantiles for the transaction counts across each customer.
obs_quantiles_tbl <- customer_fit_stats_tbl |>
reframe(
prob_label = c("p10", "p25", "p50", "p75", "p90", "p99"),
prob_value = quantile(x, probs = c(0.10, 0.25, 0.50, 0.75, 0.90, 0.99))
)
sim_data_tbl <- pnbd_init_short_fixed2_fit_simstats_tbl |>
group_by(draw_id) |>
summarise(
p10 = quantile(sim_tnx_count, 0.10),
p25 = quantile(sim_tnx_count, 0.25),
p50 = quantile(sim_tnx_count, 0.50),
p75 = quantile(sim_tnx_count, 0.75),
p90 = quantile(sim_tnx_count, 0.90),
p99 = quantile(sim_tnx_count, 0.99)
) |>
pivot_longer(
cols = !draw_id,
names_to = "prob_label",
values_to = "sim_prob_values"
) |>
inner_join(obs_quantiles_tbl, by = "prob_label")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_prob_values), binwidth = 1) +
geom_vline(aes(xintercept = prob_value), colour = "red") +
facet_wrap(vars(prob_label), nrow = 2, scales = "free") +
labs(
x = "Quantile of Counts",
y = "Frequency",
title = "Comparison Plots of Transaction Count Quantiles"
)We write this data to disk
pnbd_init_short_fixed2_fit_simstats_tbl |>
write_rds("data/pnbd_init_short_fixed2_assess_fit_simstats_tbl.rds", compress = "gz")We now repeat this exercise, but for the validation period of 2022.
precompute_dir <- glue("precompute/{fit_label}")
ensure_exists_precompute_directory(precompute_dir)[1] TRUE
use_start <- use_valid_start_date |> as.POSIXct()
use_final <- use_valid_end_date |> as.POSIXct()
pnbd_init_short_fixed2_validsims_index_tbl <- pnbd_init_short_fixed2_simstats_tbl |>
mutate(
start_dttm = use_start,
end_dttm = use_final,
lambda = post_lambda,
mu = post_mu,
tnx_mu = 1, ### We are not simulating tnx size
tnx_cv = 1 ###
) |>
group_nest(customer_id, .key = "cust_params", keep = TRUE) |>
mutate(
sim_file = glue(
"{precompute_dir}/sims_valid_{fit_label}_{customer_id}.rds"
)
)
pnbd_init_short_fixed2_validsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0016", "SFC202001_0027", "SFC202001_0094", "SFC…
$ cust_params <list<tibble[,12]>> [<tbl_df[2000 x 12]>], [<tbl_df[2000 x 12]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed2/sims_valid_pnbd_init_s…
We now can run these simulations to check how well our model captures transactions out of the fitted data.
precomputed_tbl <- dir_ls(glue("{precompute_dir}")) |>
as.character() |>
enframe(name = NULL, value = "sim_file")
runsims_tbl <- pnbd_init_short_fixed2_validsims_index_tbl |>
anti_join(precomputed_tbl, by = "sim_file")
if(nrow(runsims_tbl) > 0) {
pnbd_init_short_fixed2_validsims_index_tbl <- runsims_tbl |>
mutate(
chunk_data = future_map2_int(
cust_params, sim_file,
run_simulations_chunk,
sim_func = generate_pnbd_validation_transactions,
.options = furrr_options(
globals = c(
"calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
"generate_pnbd_validation_transactions"
),
packages = c("tidyverse", "fs"),
scheduling = Inf,
seed = 422
),
.progress = TRUE
)
)
}
pnbd_init_short_fixed2_validsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0016", "SFC202001_0027", "SFC202001_0094", "SFC…
$ cust_params <list<tibble[,12]>> [<tbl_df[2000 x 12]>], [<tbl_df[2000 x 12]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed2/sims_valid_pnbd_init_s…
Now that we have generated our simulations we want to load the data from the files and construct a dataset for use as part of the validation.
pnbd_init_short_fixed2_valid_simstats_tbl <- pnbd_init_short_fixed2_validsims_index_tbl |>
mutate(
sim_data = map(
sim_file, retrieve_sim_stats,
.progress = "pnbd_init_short_fixed2_valid"
)
) |>
select(customer_id, sim_data) |>
unnest(sim_data)
pnbd_init_short_fixed2_valid_simstats_tbl |> glimpse()Rows: 2,000,000
Columns: 5
$ customer_id <chr> "SFC202001_0016", "SFC202001_0016", "SFC202001_0016", "S…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ sim_data <list> [<tbl_df[31 x 2]>], [<tbl_df[2 x 2]>], [<tbl_df[45 x 2]…
$ sim_tnx_count <int> 31, 2, 45, 0, 36, 4, 8, 37, 6, 28, 6, 9, 39, 49, 30, 10,…
$ sim_tnx_last <dttm> 2022-12-18 22:32:20, 2022-01-07 20:21:37, 2022-12-30 11…
We start by checking the high level summary statistics, such as customers with more than one transaction both in the observed data and the simulation data, total transaction count observed
obs_customer_count <- customer_valid_stats_tbl |>
filter(tnx_count > 0) |>
nrow()
sim_data_tbl <- pnbd_init_short_fixed2_valid_simstats_tbl |>
filter(sim_tnx_count > 0) |>
count(draw_id, name = "sim_customer_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_customer_count), bins = 50) +
geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
labs(
x = "Count of Multi-transaction Customers",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Customer Counts",
subtitle = "(observed value in red)"
)We now check the count of all transactions.
obs_total_count <- customer_valid_stats_tbl |>
pull(tnx_count) |>
sum()
sim_data_tbl <- pnbd_init_short_fixed2_valid_simstats_tbl |>
count(draw_id, wt = sim_tnx_count, name = "sim_total_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_total_count), bins = 50) +
geom_vline(aes(xintercept = obs_total_count), colour = "red") +
labs(
x = "Count of Total Transactions",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Total Counts",
subtitle = "(observed value in red)"
)We now look at the quantiles for the transaction counts across each customer.
obs_quantiles_tbl <- customer_valid_stats_tbl |>
reframe(
prob_label = c("p10", "p25", "p50", "p75", "p90", "p99"),
prob_value = quantile(tnx_count, probs = c(0.10, 0.25, 0.50, 0.75, 0.90, 0.99))
)
sim_data_tbl <- pnbd_init_short_fixed2_valid_simstats_tbl |>
filter(sim_tnx_count > 0) |>
group_by(draw_id) |>
summarise(
p10 = quantile(sim_tnx_count, 0.10),
p25 = quantile(sim_tnx_count, 0.25),
p50 = quantile(sim_tnx_count, 0.50),
p75 = quantile(sim_tnx_count, 0.75),
p90 = quantile(sim_tnx_count, 0.90),
p99 = quantile(sim_tnx_count, 0.99)
) |>
pivot_longer(
cols = !draw_id,
names_to = "prob_label",
values_to = "sim_prob_values"
) |>
inner_join(obs_quantiles_tbl, by = "prob_label")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_prob_values), binwidth = 1) +
geom_vline(aes(xintercept = prob_value), colour = "red") +
facet_wrap(vars(prob_label), nrow = 2, scales = "free") +
labs(
x = "Quantile of Counts",
y = "Frequency",
title = "Comparison Plots of Transaction Count Quantiles"
)We write this data to disk
pnbd_init_short_fixed2_valid_simstats_tbl |>
write_rds("data/pnbd_init_short_fixed2_assess_valid_simstats_tbl.rds", compress = "gz")rm(pnbd_init_short_fixed2_fit_simstats_tbl)
rm(pnbd_init_short_fixed2_valid_simstats_tbl)This model appears to be struggling to properly account for the transaction frequency, as the distribution of transaction counts has a bias to the low side.
This may be do to the compressed nature of the Gamma priors due to the smaller value of the co-efficient of variation. One final model might be to try wider priors to see how they work.
syslog(
glue("Creating the third P/NBD model (higher CoV)"),
level = "INFO"
)We want to try an alternate prior model with a larger co-efficient of variation to see what impact it has on our procedures.
stan_modelname <- "pnbd_init_short_fixed3"
stanfit_prefix <- str_c("fit_", stan_modelname)
stanfit_seed <- stanfit_seed + 1
stanfit_object_file <- glue("data/{stanfit_prefix}_stanfit.rds")
stan_data_lst <- customer_fit_stats_tbl |>
select(customer_id, x, t_x, T_cal) |>
compose_data(
lambda_mn = 0.25,
lambda_cv = 2.00,
mu_mn = 0.10,
mu_cv = 2.00,
)
if(!file_exists(stanfit_object_file)) {
pnbd_init_short_fixed3_stanfit <- pnbd_fixed_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = stanfit_seed,
save_warmup = TRUE,
output_dir = stan_modeldir,
output_basename = stanfit_prefix,
)
pnbd_init_short_fixed3_stanfit$save_object(stanfit_object_file, compress = "gzip")
} else {
pnbd_init_short_fixed3_stanfit <- read_rds(stanfit_object_file)
}Running MCMC with 4 parallel chains...
Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 3 finished in 49.8 seconds.
Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 4 finished in 50.0 seconds.
Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 2 finished in 50.9 seconds.
Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 1 finished in 52.6 seconds.
All 4 chains finished successfully.
Mean chain execution time: 50.8 seconds.
Total execution time: 52.9 seconds.
pnbd_init_short_fixed3_stanfit$summary()# A tibble: 3,001 × 10
variable mean median sd mad q5 q95 rhat ess_bulk
<chr> <num> <num> <num> <num> <num> <num> <num> <num>
1 lp__ -1.17e+4 -1.17e+4 39.0 3.77e+1 -1.17e+4 -1.16e+4 1.01 585.
2 lambda[1] 7.66e-1 7.63e-1 0.0896 8.54e-2 6.22e-1 9.17e-1 1.00 3741.
3 lambda[2] 3.92e-1 3.51e-1 0.206 1.92e-1 1.26e-1 7.74e-1 1.00 3720.
4 lambda[3] 5.37e-2 2.60e-3 0.186 3.85e-3 1.21e-7 2.71e-1 1.00 1295.
5 lambda[4] 5.78e-1 4.67e-1 0.442 3.63e-1 9.36e-2 1.44e+0 0.999 2931.
6 lambda[5] 1.28e-1 1.20e-1 0.0531 4.85e-2 5.27e-2 2.27e-1 1.00 2399.
7 lambda[6] 9.48e-2 5.35e-2 0.117 6.39e-2 3.41e-3 3.20e-1 1.00 1238.
8 lambda[7] 4.68e-1 4.41e-1 0.195 1.92e-1 2.07e-1 8.13e-1 1.00 3493.
9 lambda[8] 3.32e-1 3.01e-1 0.178 1.63e-1 9.87e-2 6.59e-1 1.00 2958.
10 lambda[9] 7.75e-1 6.41e-1 0.565 4.76e-1 1.32e-1 1.88e+0 1.00 3442.
# ℹ 2,991 more rows
# ℹ 1 more variable: ess_tail <num>
We have some basic HMC-based validity statistics we can check.
pnbd_init_short_fixed3_stanfit$cmdstan_diagnose()Processing csv files: /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed3-1.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed3-2.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed3-3.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed3-4.csvWarning: non-fatal error reading adaptation data
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
No divergent transitions found.
Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.
Effective sample size satisfactory.
The following parameters had split R-hat greater than 1.05:
p_alive[76], p_alive[399], p_alive[696]
Such high values indicate incomplete mixing and biased estimation.
You should consider regularizating your model with additional prior information or a more effective parameterization.
Processing complete.
Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.
parameter_subset <- c(
"lambda[1]", "lambda[2]", "lambda[3]", "lambda[4]",
"mu[1]", "mu[2]", "mu[3]", "mu[4]"
)
pnbd_init_short_fixed3_stanfit$draws(inc_warmup = FALSE) |>
mcmc_trace(pars = parameter_subset) +
expand_limits(y = 0) +
labs(
x = "Iteration",
y = "Value",
title = "Traceplot of Sample of Lambda and Mu Values"
) +
theme(axis.text.x = element_text(size = 10))We want to check the \(N_{eff}\) statistics also.
pnbd_init_short_fixed3_stanfit |>
neff_ratio(pars = c("lambda", "mu")) |>
mcmc_neff() +
ggtitle("Plot of Parameter Effective Sample Sizes")We now want to check our in-sample data.
pnbd_stanfit <- pnbd_init_short_fixed3_stanfit |>
recover_types(customer_fit_stats_tbl)
pnbd_init_short_fixed3_simstats_tbl <- construct_pnbd_posterior_statistics(
stanfit = pnbd_stanfit,
fitdata_tbl = customer_fit_stats_tbl
)
pnbd_init_short_fixed3_simstats_tbl |> glimpse()Rows: 2,000,000
Columns: 6
$ customer_id <chr> "SFC202001_0016", "SFC202001_0016", "SFC202001_0016", "…
$ first_tnx_date <dttm> 2020-01-01 10:27:01, 2020-01-01 10:27:01, 2020-01-01 1…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ post_lambda <dbl> 0.729477, 0.763174, 0.745776, 0.781277, 0.809089, 0.746…
$ post_mu <dbl> 8.96362e-06, 1.59510e-02, 4.09596e-04, 6.52414e-03, 9.2…
$ p_alive <dbl> 0.999997, 0.995319, 0.999880, 0.998078, 0.999973, 0.999…
We write out this data to disk.
#! echo: TRUE
pnbd_init_short_fixed3_simstats_tbl |>
write_rds("data/pnbd_init_short_fixed3_assess_model_simstats_tbl.rds", compress = "gz")We now want to generate the transactions for the in-sample data.
fit_label <- "pnbd_init_short_fixed3"
precompute_dir <- glue("precompute/{fit_label}")
ensure_exists_precompute_directory(precompute_dir)[1] TRUE
pnbd_init_short_fixed3_fitsims_index_tbl <- pnbd_init_short_fixed3_simstats_tbl |>
mutate(
start_dttm = first_tnx_date |> as.POSIXct(),
end_dttm = use_fit_end_date |> as.POSIXct(),
lambda = post_lambda,
mu = post_mu,
p_alive = 1, ### In-sample validation, so customer begins active
tnx_mu = 100, ### We are not simulating tnx size, so put in defaults
tnx_cv = 1 ###
) |>
group_nest(customer_id, .key = "cust_params", keep = TRUE) |>
mutate(
sim_file = glue(
"{precompute_dir}/sims_fit_{fit_label}_{customer_id}.rds"
)
)
pnbd_init_short_fixed3_fitsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0016", "SFC202001_0027", "SFC202001_0094", "SFC…
$ cust_params <list<tibble[,12]>> [<tbl_df[2000 x 12]>], [<tbl_df[2000 x 12]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed3/sims_fit_pnbd_init_sho…
We now use this setup to generate our simulations.
precomputed_tbl <- dir_ls(glue("{precompute_dir}")) |>
as.character() |>
enframe(name = NULL, value = "sim_file")
runsims_tbl <- pnbd_init_short_fixed3_fitsims_index_tbl |>
anti_join(precomputed_tbl, by = "sim_file")
if(nrow(runsims_tbl) > 0) {
pnbd_init_short_fixed3_fitsims_index_tbl <- runsims_tbl |>
mutate(
chunk_data = future_map2_int(
cust_params, sim_file,
run_simulations_chunk,
sim_func = generate_pnbd_validation_transactions,
.options = furrr_options(
globals = c(
"calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
"generate_pnbd_validation_transactions"
),
packages = c("tidyverse", "fs"),
scheduling = Inf,
seed = 423
),
.progress = TRUE
)
)
}
pnbd_init_short_fixed3_fitsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0016", "SFC202001_0027", "SFC202001_0094", "SFC…
$ cust_params <list<tibble[,12]>> [<tbl_df[2000 x 12]>], [<tbl_df[2000 x 12]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed3/sims_fit_pnbd_init_sho…
We now want to load up the summary statistics for each of our customers for later analysis.
pnbd_init_short_fixed3_fit_simstats_tbl <- pnbd_init_short_fixed3_fitsims_index_tbl |>
mutate(
sim_data = map(
sim_file, retrieve_sim_stats,
.progress = "pnbd_init_short_fixed3_fit"
)
) |>
select(customer_id, sim_data) |>
unnest(sim_data)
pnbd_init_short_fixed3_fit_simstats_tbl |> glimpse()Rows: 2,000,000
Columns: 5
$ customer_id <chr> "SFC202001_0016", "SFC202001_0016", "SFC202001_0016", "S…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ sim_data <list> [<tbl_df[71 x 2]>], [<tbl_df[10 x 2]>], [<tbl_df[76 x 2…
$ sim_tnx_count <int> 71, 10, 76, 3, 81, 72, 78, 101, 70, 76, 76, 76, 73, 93, …
$ sim_tnx_last <dttm> 2021-12-31 18:32:47, 2020-03-03 09:45:52, 2021-12-31 11…
We now use this data to check how well our model fits the data.
We start by checking the high level summary statistics, such as customers with more than one transaction both in the observed data and the simulation data, total transaction count observed
obs_customer_count <- customer_fit_stats_tbl |>
filter(x > 0) |>
nrow()
sim_data_tbl <- pnbd_init_short_fixed3_fit_simstats_tbl |>
filter(sim_tnx_count > 0) |>
count(draw_id, name = "sim_customer_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_customer_count), bins = 50) +
geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
labs(
x = "Count of Multi-transaction Customers",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Customer Counts",
subtitle = "(observed value in red)"
)The observed count of customers at least one additional transaction after the first is captured by this simulation.
We now check the count of all transactions.
obs_total_count <- customer_fit_stats_tbl |>
pull(x) |>
sum()
sim_data_tbl <- pnbd_init_short_fixed3_fit_simstats_tbl |>
count(draw_id, wt = sim_tnx_count, name = "sim_total_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_total_count), bins = 50) +
geom_vline(aes(xintercept = obs_total_count), colour = "red") +
labs(
x = "Count of Total Transactions",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Total Counts",
subtitle = "(observed value in red)"
)As before, these all look good. Our model is doing a good job capturing the data.
We now look at the quantiles for the transaction counts across each customer.
obs_quantiles_tbl <- customer_fit_stats_tbl |>
reframe(
prob_label = c("p10", "p25", "p50", "p75", "p90", "p99"),
prob_value = quantile(x, probs = c(0.10, 0.25, 0.50, 0.75, 0.90, 0.99))
)
sim_data_tbl <- pnbd_init_short_fixed3_fit_simstats_tbl |>
group_by(draw_id) |>
summarise(
p10 = quantile(sim_tnx_count, 0.10),
p25 = quantile(sim_tnx_count, 0.25),
p50 = quantile(sim_tnx_count, 0.50),
p75 = quantile(sim_tnx_count, 0.75),
p90 = quantile(sim_tnx_count, 0.90),
p99 = quantile(sim_tnx_count, 0.99)
) |>
pivot_longer(
cols = !draw_id,
names_to = "prob_label",
values_to = "sim_prob_values"
) |>
inner_join(obs_quantiles_tbl, by = "prob_label")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_prob_values), binwidth = 1) +
geom_vline(aes(xintercept = prob_value), colour = "red") +
facet_wrap(vars(prob_label), nrow = 2, scales = "free") +
labs(
x = "Quantile of Counts",
y = "Frequency",
title = "Comparison Plots of Transaction Count Quantiles"
)We write this data to disk
pnbd_init_short_fixed3_fit_simstats_tbl |>
write_rds("data/pnbd_init_short_fixed3_assess_fit_simstats_tbl.rds", compress = "gz")We now repeat this exercise, but for the validation period of 2022.
precompute_dir <- glue("precompute/{fit_label}")
ensure_exists_precompute_directory(precompute_dir)[1] TRUE
use_start <- use_valid_start_date |> as.POSIXct()
use_final <- use_valid_end_date |> as.POSIXct()
pnbd_init_short_fixed3_validsims_index_tbl <- pnbd_init_short_fixed3_simstats_tbl |>
mutate(
start_dttm = use_start,
end_dttm = use_final,
lambda = post_lambda,
mu = post_mu,
tnx_mu = 1, ### We are not simulating tnx size
tnx_cv = 1 ###
) |>
group_nest(customer_id, .key = "cust_params", keep = TRUE) |>
mutate(
sim_file = glue(
"{precompute_dir}/sims_valid_{fit_label}_{customer_id}.rds"
)
)
pnbd_init_short_fixed3_validsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0016", "SFC202001_0027", "SFC202001_0094", "SFC…
$ cust_params <list<tibble[,12]>> [<tbl_df[2000 x 12]>], [<tbl_df[2000 x 12]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed3/sims_valid_pnbd_init_s…
We now can run these simulations to check how well our model captures transactions out of the fitted data.
precomputed_tbl <- dir_ls(glue("{precompute_dir}")) |>
as.character() |>
enframe(name = NULL, value = "sim_file")
runsims_tbl <- pnbd_init_short_fixed3_validsims_index_tbl |>
anti_join(precomputed_tbl, by = "sim_file")
if(nrow(runsims_tbl) > 0) {
pnbd_init_short_fixed3_validsims_index_tbl <- runsims_tbl |>
mutate(
chunk_data = future_map2_int(
cust_params, sim_file,
run_simulations_chunk,
sim_func = generate_pnbd_validation_transactions,
.options = furrr_options(
globals = c(
"calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
"generate_pnbd_validation_transactions"
),
packages = c("tidyverse", "fs"),
scheduling = Inf,
seed = 423
),
.progress = TRUE
)
)
}
pnbd_init_short_fixed3_validsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0016", "SFC202001_0027", "SFC202001_0094", "SFC…
$ cust_params <list<tibble[,12]>> [<tbl_df[2000 x 12]>], [<tbl_df[2000 x 12]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed3/sims_valid_pnbd_init_s…
Now that we have generated our simulations we want to load the data from the files and construct a dataset for use as part of the validation.
pnbd_init_short_fixed3_valid_simstats_tbl <- pnbd_init_short_fixed3_validsims_index_tbl |>
mutate(
sim_data = map(
sim_file, retrieve_sim_stats,
.progress = "pnbd_init_short_fixed3_valid"
)
) |>
select(customer_id, sim_data) |>
unnest(sim_data)
pnbd_init_short_fixed3_valid_simstats_tbl |> glimpse()Rows: 2,000,000
Columns: 5
$ customer_id <chr> "SFC202001_0016", "SFC202001_0016", "SFC202001_0016", "S…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ sim_data <list> [<tbl_df[38 x 2]>], [<tbl_df[4 x 2]>], [<tbl_df[43 x 2]…
$ sim_tnx_count <int> 38, 4, 43, 9, 42, 40, 34, 48, 46, 44, 40, 35, 29, 54, 15…
$ sim_tnx_last <dttm> 2022-12-19 09:02:22, 2022-02-28 15:44:52, 2022-12-26 16…
We start by checking the high level summary statistics, such as customers with more than one transaction both in the observed data and the simulation data, total transaction count observed
obs_customer_count <- customer_valid_stats_tbl |>
filter(tnx_count > 0) |>
nrow()
sim_data_tbl <- pnbd_init_short_fixed3_valid_simstats_tbl |>
filter(sim_tnx_count > 0) |>
count(draw_id, name = "sim_customer_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_customer_count), bins = 50) +
geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
labs(
x = "Count of Multi-transaction Customers",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Customer Counts",
subtitle = "(observed value in red)"
)We now check the count of all transactions.
obs_total_count <- customer_valid_stats_tbl |>
pull(tnx_count) |>
sum()
sim_data_tbl <- pnbd_init_short_fixed3_valid_simstats_tbl |>
count(draw_id, wt = sim_tnx_count, name = "sim_total_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_total_count), bins = 50) +
geom_vline(aes(xintercept = obs_total_count), colour = "red") +
labs(
x = "Count of Total Transactions",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Total Counts",
subtitle = "(observed value in red)"
)We now look at the quantiles for the transaction counts across each customer.
obs_quantiles_tbl <- customer_valid_stats_tbl |>
reframe(
prob_label = c("p10", "p25", "p50", "p75", "p90", "p99"),
prob_value = quantile(tnx_count, probs = c(0.10, 0.25, 0.50, 0.75, 0.90, 0.99))
)
sim_data_tbl <- pnbd_init_short_fixed3_valid_simstats_tbl |>
filter(sim_tnx_count > 0) |>
group_by(draw_id) |>
summarise(
p10 = quantile(sim_tnx_count, 0.10),
p25 = quantile(sim_tnx_count, 0.25),
p50 = quantile(sim_tnx_count, 0.50),
p75 = quantile(sim_tnx_count, 0.75),
p90 = quantile(sim_tnx_count, 0.90),
p99 = quantile(sim_tnx_count, 0.99)
) |>
pivot_longer(
cols = !draw_id,
names_to = "prob_label",
values_to = "sim_prob_values"
) |>
inner_join(obs_quantiles_tbl, by = "prob_label")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_prob_values), binwidth = 1) +
geom_vline(aes(xintercept = prob_value), colour = "red") +
facet_wrap(vars(prob_label), nrow = 2, scales = "free") +
labs(
x = "Quantile of Counts",
y = "Frequency",
title = "Comparison Plots of Transaction Count Quantiles"
)We write this data to disk
pnbd_init_short_fixed3_fit_simstats_tbl |>
write_rds("data/pnbd_init_short_fixed3_assess_valid_simstats_tbl.rds", compress = "gz")rm(pnbd_init_short_fixed3_fit_simstats_tbl)
rm(pnbd_init_short_fixed3_valid_simstats_tbl)The behaviour of this model is a little curious - in-sample the model under-estimates the count of repeat customers but out-of-sample it is fine.
At the same time, the model over-estimates the total transaction count for both in-sample and out of sample.
Similarly, the higher quantiles of transaction count is over-estimated.
This suggests that the model tends towards higher transaction frequency, but lower lifetime values. This should be useful when fitting on real data.
We have looked at each of the models individually, but it is also worth looking at each of the models as a group.
calculate_simulation_statistics <- function(file_rds) {
simdata_tbl <- read_rds(file_rds)
multicount_cust_tbl <- simdata_tbl |>
filter(sim_tnx_count > 0) |>
count(draw_id, name = "multicust_count")
totaltnx_data_tbl <- simdata_tbl |>
count(draw_id, wt = sim_tnx_count, name = "simtnx_count")
simstats_tbl <- multicount_cust_tbl |>
inner_join(totaltnx_data_tbl, by = "draw_id")
return(simstats_tbl)
}obs_fit_customer_count <- customer_fit_stats_tbl |>
filter(x > 0) |>
nrow()
obs_valid_customer_count <- customer_valid_stats_tbl |>
filter(tnx_count > 0) |>
nrow()
obs_fit_total_count <- customer_fit_stats_tbl |>
pull(x) |>
sum()
obs_valid_total_count <- customer_valid_stats_tbl |>
pull(tnx_count) |>
sum()
obs_stats_tbl <- tribble(
~assess_type, ~name, ~obs_value,
"fit", "multicust_count", obs_fit_customer_count,
"fit", "simtnx_count", obs_fit_total_count,
"valid", "multicust_count", obs_valid_customer_count,
"valid", "simtnx_count", obs_valid_total_count
)
model_assess_tbl <- dir_ls("data", regexp = "pnbd_init_short_fixed.*_assess_") |>
enframe(name = NULL, value = "file_path") |>
filter(str_detect(file_path, "_assess_model_", negate = TRUE)) |>
mutate(
model_label = str_replace(file_path, "data/pnbd_init_short_(fixed.*?)_assess_.*", "\\1"),
assess_type = if_else(str_detect(file_path, "_assess_fit_"), "fit", "valid"),
sim_data = map(file_path, calculate_simulation_statistics)
)
model_assess_tbl |> glimpse()Rows: 6
Columns: 4
$ file_path <fs::path> "data/pnbd_init_short_fixed1_assess_fit_simstats_tbl.…
$ model_label <chr> "fixed1", "fixed1", "fixed2", "fixed2", "fixed3", "fixed3"
$ assess_type <chr> "fit", "valid", "fit", "valid", "fit", "valid"
$ sim_data <list> [<tbl_df[2000 x 3]>], [<tbl_df[2000 x 3]>], [<tbl_df[2000 …
We now take this data and calculate summary statistics based on it.
model_assess_summstat_tbl <- model_assess_tbl |>
select(model_label, assess_type, sim_data) |>
unnest(sim_data) |>
pivot_longer(
cols = !c(model_label, assess_type, draw_id)
) |>
group_by(model_label, assess_type, name) |>
summarise(
.groups = "drop",
mean_val = mean(value),
p10 = quantile(value, 0.10),
p25 = quantile(value, 0.25),
p50 = quantile(value, 0.50),
p75 = quantile(value, 0.75),
p90 = quantile(value, 0.90)
)
model_assess_summstat_tbl |> glimpse()Rows: 12
Columns: 9
$ model_label <chr> "fixed1", "fixed1", "fixed1", "fixed1", "fixed2", "fixed2"…
$ assess_type <chr> "fit", "fit", "valid", "valid", "fit", "fit", "valid", "va…
$ name <chr> "multicust_count", "simtnx_count", "multicust_count", "sim…
$ mean_val <dbl> 619.8015, 4314.2750, 178.1055, 1822.4220, 681.0555, 3200.6…
$ p10 <dbl> 601.0, 4045.9, 168.0, 1655.0, 664.0, 3010.9, 127.0, 822.9,…
$ p25 <dbl> 611, 4168, 173, 1737, 672, 3093, 132, 871, 525, 4858, 525,…
$ p50 <dbl> 620.0, 4310.0, 178.0, 1820.5, 681.0, 3196.0, 137.0, 931.0,…
$ p75 <dbl> 629, 4457, 183, 1908, 690, 3301, 141, 993, 540, 5183, 540,…
$ p90 <dbl> 637.0, 4593.1, 188.0, 1984.1, 698.0, 3397.0, 146.0, 1046.0…
#! echo: TRUE
ggplot(model_assess_summstat_tbl) +
geom_errorbar(
aes(x = model_label, ymin = p10, ymax = p90), width = 0
) +
geom_errorbar(
aes(x = model_label, ymin = p25, ymax = p75), width = 0, linewidth = 3
) +
geom_hline(
aes(yintercept = obs_value),
data = obs_stats_tbl, colour = "red"
) +
scale_y_continuous(labels = label_comma()) +
expand_limits(y = 0) +
facet_wrap(
vars(assess_type, name), scale = "free_y"
) +
labs(
x = "Model",
y = "Count",
title = "Comparison Plot for the Different Models"
)We now want to save the assessment data to disk.
model_assess_tbl |> write_rds("data/assess_data_pnbd_init_short_fixed_tbl.rds")options(width = 120L)
sessioninfo::session_info()─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.3.1 (2023-06-16)
os Ubuntu 22.04.3 LTS
system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Dublin
date 2023-11-06
pandoc 3.1.1 @ /usr/local/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] RSPM (R 4.3.0)
arrayhelpers 1.1-0 2020-02-04 [1] RSPM (R 4.3.0)
backports 1.4.1 2021-12-13 [1] RSPM (R 4.3.0)
base64enc 0.1-3 2015-07-28 [1] RSPM (R 4.3.0)
bayesplot * 1.10.0 2022-11-16 [1] RSPM (R 4.3.0)
bit 4.0.5 2022-11-15 [1] RSPM (R 4.3.0)
bit64 4.0.5 2020-08-30 [1] RSPM (R 4.3.0)
bridgesampling 1.1-2 2021-04-16 [1] RSPM (R 4.3.0)
brms * 2.20.4 2023-09-25 [1] RSPM (R 4.3.0)
Brobdingnag 1.2-9 2022-10-19 [1] RSPM (R 4.3.0)
cachem 1.0.8 2023-05-01 [1] RSPM (R 4.3.0)
callr 3.7.3 2022-11-02 [1] RSPM (R 4.3.0)
checkmate 2.3.0 2023-10-25 [1] RSPM (R 4.3.0)
cli 3.6.1 2023-03-23 [1] RSPM (R 4.3.0)
cmdstanr * 0.6.0.9000 2023-11-02 [1] Github (stan-dev/cmdstanr@a13c798)
coda 0.19-4 2020-09-30 [1] RSPM (R 4.3.0)
codetools 0.2-19 2023-02-01 [2] CRAN (R 4.3.1)
colorspace 2.1-0 2023-01-23 [1] RSPM (R 4.3.0)
colourpicker 1.3.0 2023-08-21 [1] RSPM (R 4.3.0)
conflicted * 1.2.0 2023-02-01 [1] RSPM (R 4.3.0)
cowplot * 1.1.1 2020-12-30 [1] RSPM (R 4.3.0)
crayon 1.5.2 2022-09-29 [1] RSPM (R 4.3.0)
crosstalk 1.2.0 2021-11-04 [1] RSPM (R 4.3.0)
curl 5.1.0 2023-10-02 [1] RSPM (R 4.3.0)
data.table 1.14.8 2023-02-17 [1] RSPM (R 4.3.0)
digest 0.6.33 2023-07-07 [1] RSPM (R 4.3.0)
directlabels * 2023.8.25 2023-09-01 [1] RSPM (R 4.3.0)
distributional 0.3.2 2023-03-22 [1] RSPM (R 4.3.0)
dplyr * 1.1.3 2023-09-03 [1] RSPM (R 4.3.0)
DT 0.30 2023-10-05 [1] RSPM (R 4.3.0)
dygraphs 1.1.1.6 2018-07-11 [1] RSPM (R 4.3.0)
ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.3.0)
evaluate 0.22 2023-09-29 [1] RSPM (R 4.3.0)
fansi 1.0.5 2023-10-08 [1] RSPM (R 4.3.0)
farver 2.1.1 2022-07-06 [1] RSPM (R 4.3.0)
fastmap 1.1.1 2023-02-24 [1] RSPM (R 4.3.0)
forcats * 1.0.0 2023-01-29 [1] RSPM (R 4.3.0)
fs * 1.6.3 2023-07-20 [1] RSPM (R 4.3.0)
furrr * 0.3.1 2022-08-15 [1] RSPM (R 4.3.0)
future * 1.33.0 2023-07-01 [1] RSPM (R 4.3.0)
generics 0.1.3 2022-07-05 [1] RSPM (R 4.3.0)
ggdist 3.3.0 2023-05-13 [1] RSPM (R 4.3.0)
ggplot2 * 3.4.4 2023-10-12 [1] RSPM (R 4.3.0)
globals 0.16.2 2022-11-21 [1] RSPM (R 4.3.0)
glue * 1.6.2 2022-02-24 [1] RSPM (R 4.3.0)
gridExtra 2.3 2017-09-09 [1] RSPM (R 4.3.0)
gtable 0.3.4 2023-08-21 [1] RSPM (R 4.3.0)
gtools 3.9.4 2022-11-27 [1] RSPM (R 4.3.0)
hms 1.1.3 2023-03-21 [1] RSPM (R 4.3.0)
htmltools 0.5.6.1 2023-10-06 [1] RSPM (R 4.3.0)
htmlwidgets 1.6.2 2023-03-17 [1] RSPM (R 4.3.0)
httpuv 1.6.12 2023-10-23 [1] RSPM (R 4.3.0)
igraph 1.5.1 2023-08-10 [1] RSPM (R 4.3.0)
inline 0.3.19 2021-05-31 [1] RSPM (R 4.3.0)
jsonlite 1.8.7 2023-06-29 [1] RSPM (R 4.3.0)
knitr 1.44 2023-09-11 [1] RSPM (R 4.3.0)
labeling 0.4.3 2023-08-29 [1] RSPM (R 4.3.0)
later 1.3.1 2023-05-02 [1] RSPM (R 4.3.0)
lattice 0.21-8 2023-04-05 [2] CRAN (R 4.3.1)
lifecycle 1.0.3 2022-10-07 [1] RSPM (R 4.3.0)
listenv 0.9.0 2022-12-16 [1] RSPM (R 4.3.0)
loo 2.6.0 2023-03-31 [1] RSPM (R 4.3.0)
lubridate * 1.9.3 2023-09-27 [1] RSPM (R 4.3.0)
magrittr * 2.0.3 2022-03-30 [1] RSPM (R 4.3.0)
markdown 1.11 2023-10-19 [1] RSPM (R 4.3.0)
Matrix 1.5-4.1 2023-05-18 [2] CRAN (R 4.3.1)
matrixStats 1.0.0 2023-06-02 [1] RSPM (R 4.3.0)
memoise 2.0.1 2021-11-26 [1] RSPM (R 4.3.0)
mime 0.12 2021-09-28 [1] RSPM (R 4.3.0)
miniUI 0.1.1.1 2018-05-18 [1] RSPM (R 4.3.0)
munsell 0.5.0 2018-06-12 [1] RSPM (R 4.3.0)
mvtnorm 1.2-3 2023-08-25 [1] RSPM (R 4.3.0)
nlme 3.1-162 2023-01-31 [2] CRAN (R 4.3.1)
parallelly 1.36.0 2023-05-26 [1] RSPM (R 4.3.0)
pillar 1.9.0 2023-03-22 [1] RSPM (R 4.3.0)
pkgbuild 1.4.2 2023-06-26 [1] RSPM (R 4.3.0)
pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.3.0)
plyr 1.8.9 2023-10-02 [1] RSPM (R 4.3.0)
posterior * 1.4.1 2023-03-14 [1] RSPM (R 4.3.0)
prettyunits 1.2.0 2023-09-24 [1] RSPM (R 4.3.0)
processx 3.8.2 2023-06-30 [1] RSPM (R 4.3.0)
promises 1.2.1 2023-08-10 [1] RSPM (R 4.3.0)
ps 1.7.5 2023-04-18 [1] RSPM (R 4.3.0)
purrr * 1.0.2 2023-08-10 [1] RSPM (R 4.3.0)
quadprog 1.5-8 2019-11-20 [1] RSPM (R 4.3.0)
QuickJSR 1.0.7 2023-10-15 [1] RSPM (R 4.3.0)
R6 2.5.1 2021-08-19 [1] RSPM (R 4.3.0)
Rcpp * 1.0.11 2023-07-06 [1] RSPM (R 4.3.0)
RcppParallel 5.1.7 2023-02-27 [1] RSPM (R 4.3.0)
readr * 2.1.4 2023-02-10 [1] RSPM (R 4.3.0)
reshape2 1.4.4 2020-04-09 [1] RSPM (R 4.3.0)
rlang * 1.1.1 2023-04-28 [1] RSPM (R 4.3.0)
rmarkdown 2.25 2023-09-18 [1] RSPM (R 4.3.0)
rstan 2.32.3 2023-10-15 [1] RSPM (R 4.3.0)
rstantools 2.3.1.1 2023-07-18 [1] RSPM (R 4.3.0)
rstudioapi 0.15.0 2023-07-07 [1] RSPM (R 4.3.0)
rsyslog * 1.0.3 2023-05-08 [1] RSPM (R 4.3.0)
scales * 1.2.1 2022-08-20 [1] RSPM (R 4.3.0)
sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.3.0)
shiny 1.7.5.1 2023-10-14 [1] RSPM (R 4.3.0)
shinyjs 2.1.0 2021-12-23 [1] RSPM (R 4.3.0)
shinystan 2.6.0 2022-03-03 [1] RSPM (R 4.3.0)
shinythemes 1.2.0 2021-01-25 [1] RSPM (R 4.3.0)
StanHeaders 2.26.28 2023-09-07 [1] RSPM (R 4.3.0)
stringi 1.7.12 2023-01-11 [1] RSPM (R 4.3.0)
stringr * 1.5.0 2022-12-02 [1] RSPM (R 4.3.0)
svUnit 1.0.6 2021-04-19 [1] RSPM (R 4.3.0)
tensorA 0.36.2 2020-11-19 [1] RSPM (R 4.3.0)
threejs 0.3.3 2020-01-21 [1] RSPM (R 4.3.0)
tibble * 3.2.1 2023-03-20 [1] RSPM (R 4.3.0)
tidybayes * 3.0.6 2023-08-12 [1] RSPM (R 4.3.0)
tidyr * 1.3.0 2023-01-24 [1] RSPM (R 4.3.0)
tidyselect 1.2.0 2022-10-10 [1] RSPM (R 4.3.0)
tidyverse * 2.0.0 2023-02-22 [1] RSPM (R 4.3.0)
timechange 0.2.0 2023-01-11 [1] RSPM (R 4.3.0)
tzdb 0.4.0 2023-05-12 [1] RSPM (R 4.3.0)
utf8 1.2.4 2023-10-22 [1] RSPM (R 4.3.0)
V8 4.4.0 2023-10-09 [1] RSPM (R 4.3.0)
vctrs 0.6.4 2023-10-12 [1] RSPM (R 4.3.0)
vroom 1.6.4 2023-10-02 [1] RSPM (R 4.3.0)
withr 2.5.1 2023-09-26 [1] RSPM (R 4.3.0)
xfun 0.40 2023-08-09 [1] RSPM (R 4.3.0)
xtable 1.8-4 2019-04-21 [1] RSPM (R 4.3.0)
xts 0.13.1 2023-04-16 [1] RSPM (R 4.3.0)
yaml 2.3.7 2023-01-23 [1] RSPM (R 4.3.0)
zoo 1.8-12 2023-04-13 [1] RSPM (R 4.3.0)
[1] /usr/local/lib/R/site-library
[2] /usr/local/lib/R/library
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
options(width = 80L)